AITopics | target token

Most crucially, they require prohibitively large amounts of additional memory since they rely on backpropagation which allocates almost twice as much GPU memory as the forward pass. This renders it difficult, if not impossible, to use explanations in production.

explanation, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
Asia (0.04)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

8df705957a5262de3cb37ba9f1fb96f3-Paper-Conference.pdf

Neural Information Processing SystemsFeb-15-2026, 19:44:59 GMT

computational linguistic, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
North America > Canada > Ontario > Toronto (0.04)
(14 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)

Add feedback

b597460c506e8e35fb0cc1c1905dd3bc-Paper.pdf

Neural Information Processing SystemsFeb-10-2026, 20:17:37 GMT

correction model, edit alignment, fastcorrect, (13 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.74)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

194fa4536bf36f35a4505d20cd5dd6fc-Paper-Conference.pdf

Neural Information Processing SystemsFeb-8-2026, 16:45:05 GMT

dataset, information, video token, (15 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Vision > Video Understanding (0.65)

Add feedback

From Projection to Prediction: Beyond Logits for Scalable Language Models

Dong, Jianbing, Chang, Jianbin

arXiv.org Artificial IntelligenceNov-25-2025

Training Large Language Models (LLMs) typically involves a two-stage pipeline at the output layer: hidden states are projected into vocabulary logits via a linear transformation (lm_head), followed by cross-entropy loss computation against target tokens. While conceptually simple, this design incurs substantial overhead. The intermediate logits tensor, with dimensions proportional to batch size, sequence length, and vocabulary size, must be fully materialized in GPU memory, even though only one target token per position is ultimately used. This leads to significant memory footprint and bandwidth comsumption, limiting scalability and slowing training throughput. In this work, we introduce a novel approach to integrates the output projection and loss prediction into a single operation. By directly computing the loss from hidden states and target tokens, our approach bypasses explicit logits materialization. This design reduces memory usage and alleviates bandwidth pressure. Experiments on LLM training demonstrate that our method achieves substantial memory savings and measurable speedups compared to the standard two-stage pipeline, enabling large batch sizes and longer sequences without sacrificing accuracy. Our work highlights the benefits of rethinking the boundary between projection and prediction, offering a practical systems optimization for efficient LLM training.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.17599

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation

Tianyu He, Xu Tan, Yingce Xia, Di He, Tao Qin, Zhibo Chen, Tie-Yan Liu

Neural Information Processing SystemsNov-20-2025, 16:23:12 GMT

Neural Machine Translation (NMT) has achieved remarkable progress with the quick evolvement of model structures. In this paper, we propose the concept of layer-wise coordination for NMT, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually from low level to high level.

artificial intelligence, machine learning, natural language, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Berlin (0.04)
(11 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Yang, Chun-Hao, Feng, Bo-Han, Lai, Tzu-Yuan, Chen, Yan Yu, Huang, Yin-Kai Dean, Lin, Shou-De

arXiv.org Artificial IntelligenceNov-4-2025

Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2511.00198

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.94)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models

Bajaj, Anooshka, Mistry, Deven Mahesh, Maini, Sahaj Singh, Aggarwal, Yash, Tiganj, Zoran

arXiv.org Artificial IntelligenceOct-28-2025

In-context learning is governed by both temporal and semantic relationships, shaping how Large Language Models (LLMs) retrieve contextual information. Analogous to human episodic memory, where the retrieval of specific events is enabled by separating events that happened at different times, this work probes the ability of various pretrained LLMs, including transformer and state-space models, to differentiate and retrieve temporally separated events. Specifically, we prompted models with sequences containing multiple presentations of the same token, which reappears at the sequence end. By fixing the positions of these repeated tokens and permuting all others, we removed semantic confounds and isolated temporal effects on next-token prediction. Across diverse sequences, models consistently placed the highest probabilities on tokens following a repeated token, but with a notable bias for those nearest the beginning or end of the input. An ablation experiment linked this phenomenon in transformers to induction heads. Extending the analysis to unique semantic contexts with partial overlap further demonstrated that memories embedded in the middle of a prompt are retrieved less reliably. Despite architectural differences, state-space and transformer models showed comparable temporal biases. Our findings deepen the understanding of temporal biases in in-context learning and offer an illustration of how these biases can enable temporal separation and episodic retrieval.

artificial intelligence, large language model, natural language, (17 more...)

arXiv.org Artificial Intelligence

2510.22752

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Filters

Collaborating Authors

target token

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Alleviating Hallucinations in Large Vision-Language Models through Hallucination-Induced Optimization

c83bc020a020cdeb966ed10804619664-Supplemental-Conference.pdf

Understanding Transformer Predictions Through Memory Efficient Attention Manipulation

8df705957a5262de3cb37ba9f1fb96f3-Paper-Conference.pdf

b597460c506e8e35fb0cc1c1905dd3bc-Paper.pdf

194fa4536bf36f35a4505d20cd5dd6fc-Paper-Conference.pdf

From Projection to Prediction: Beyond Logits for Scalable Language Models

Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Beyond Semantics: How Temporal Biases Shape Retrieval in Transformer and State-Space Models